|
|
|
|
|
|
|
|

| Disclaimer Information in this document is provided in connection with Intel products. No license under any patent or copyright is granted expressly or implied by this publication. Intel assumes no liability whatsoever, including infringement of any patent or copyright, for sale and use of Intel products except as provided in Intel's Terms and Conditions of Sale for such products. Intel retains the right to make changes to these specifications at any time, without notice. Microcomputer Products may have minor variations to their specifications known as errata. MPEG is an international standard for audio and video compression and decompression promoted by ISO. Implementations of MPEG CODEC’s or MPEG enabled platforms may require licenses from various entities, including Intel Corporation. Intel makes no representation as to the need for licenses from any entity. No licenses, either express, implied or by estoppel are granted herein. For information on licensing Intel patents, please contact Intel. |
|
1.0 INTRODUCTION
|
The media extensions to the Intel Architecture (IA) instruction set include single-instruction, multiple-data (SIMD) instructions. This document describes an MMX technology implementation of a procedure to perform an absolute difference on a 16x16 block of pixels. This procedure can be an integral part of a motion estimation kernel, as will be described.
An important technique used in video compression is to try and predict movement between consecutive frames. In many cases a moving object stays the same from frame to frame and only moves across the viewing field. A substantially better compression ratio can be achieved by producing displacement vectors of the object from frame to frame, instead of compressing the object for every frame. Calculating these vectors is called motion estimation and requires the calculation of an absolute difference between blocks of the frames.
The motion estimation function sums the absolute differences (or squared differences) between the pixel values of two different 16x16 blocks, and finds the best match. In MPEG1 for example, the calculation can be made in four ways. Either the absolute differences between the pixel values is summed (L1 norm) or the square of the differences (L2 norm). Orthogonal to the difference equation, this sum can be accumulated with respect to a reference block that has been shifted either by some number of whole pixel positions or by some number of half-pixel positions.
A C Language code example for a 16x16 pixel full pel motion estimation using absolute differences is given in Example 1. The code has a fast-out so that if the difference accumulated across some rows is more than the current best match, it aborts the rest of the absolute difference.
char *bptr; /* pointer to start row of 16x16 pixel block being compressed.*/
char *cptr; /* pointer to start row of 16x16 pixel reference block.*/
val=0;
for(i=0;i<16;i++)
{
for(j=0;j<16;j++)
{
data=(*(bptr++)-*(cptr++));
if (data<0) {val-=data;}
else {val+=data;}
}
/* Fast out after this row if we've exceeded best match*/
if (val > best_value) break;
/* update pointer to next row*/
bptr += (rm->width - 16);
/* update pointer to next row */
cptr += (cm->width - 16);
}
The flow of the motion estimation inner loop code when using MMX instructions is shown in Figure 1. The operation uses a PSUBUSB (packed-byte-subtract-unsigned-with-saturation) to generate the absolute differences without requiring 16-bit precision to perform the operation.
Usually, subtraction of two 8-bit unsigned numbers produces a 9-bit signed result. Thus, in order to keep the precision it may seem that a conversion to 16 bits is needed before the subtract operation. This becomes unnecessary by using the PSUBUSB instruction. By subtracting source 2 from source 1 and then doing the opposite operation on a copy of one of the sources, each result register has an absolute difference value. This is only true when an element in source 1 is larger than an element in source 2; if not, the result is zero because a negative result saturates to zero. The same thing happens for the opposite subtraction. This takes care of generating the absolute differences for the cases in which the elements in source 2 are larger than the elements in source 1.
Using the OR command on the results of the two subtractions generates the eight desired absolute difference results. This enables us to find the absolute difference in byte precision substantially faster than if done in 16 bit precision.

The MMX code for the inner-most computation of absolute differences is presented in Example 2. This code does not have the fast out option as does the C Language code of Example 1. In general, a fast out is not relevant after each 16x1 estimation as the overhead cost of adding the fast out for every iteration of the MMX instruction loop is prohibitive.
estim Proc C Public uses m1:DWORD, m2:DWORD
movq mm0, [m1]
movq mm1, [m2]
movq mm2, mm0
movq mm3, 8[m1]
psubusb mm0, mm1
psubusb mm1, mm2
movq mm4, 8[m2]
por mm0, mm1
movq mm5, mm3
movq mm1, mm0
psubusb mm3, mm4
punpckbw mm0, mm6
psubusb mm4, mm5
psrlq mm1, 32
por mm3, mm4
punpckbw mm1, mm6
movq mm4, mm3
punpckbw mm3, mm6
paddw mm7, mm0
psrlq mm4, 32
paddw mm7, mm1
punpckbw mm4, mm6
paddw mm7, mm3
paddw mm7, mm4
ret
estim EndP
Note that the misalignment penalty of memory accesses
is an important factor in the motion estimation loop. Thus, minimizing
the number of memory accesses is very important. For this reason,
it is faster to load the data once and copy it to avoid destruction
inherent to the IA two-operand model than to base the code on
MMX instructions with one of the operands from memory.
It is possible to perform a version of the absolute difference using the same technique as the MMX instructions, but this can only be done on one data element at a time instead of utilizing the parallelism that MMX instructions can provide. This results in more optimized scalar code and is listed below in Example 3.
estim Proc C Public uses m1: DWORD, m1:DWORD
pushl esi
pushl edi
pushl ebx
pushl m1
pushl m2
popl edi
popl esi
xorl ecx, ecx
xorl edx, edx
xorl ebx, ebx
xorl eax, eax
movb bl, [esi]
movb cl, [edi]
subl ecx, ebx
movb bl, [esi]
xorb cl, ch
movb dl, 1[edi]
subb cl, ch
subl edx, ebx
andl ecx, 0xff
xorb dl, dh
addl eax, ecx
subb dl, dh
andl edx, 0xff
movb bl, 2[esi]
addl eax, edx
movb cl, 2[edi]
subl ecx, ebx
movb bl, 3[esi]
xorb cl, ch
movb dl, 3[edi]
subb cl, ch
subl edx, ebx
andl ecx, 0xff
xorb dl, dh
addl eax, ecx
subb dl, dh
andl edx, 0xff
movb bl, 4[esi]
addl eax, edx
movb cl, 4[edi]
subl ecx, ebx
movb bl, 5[esi]
xorb cl, ch
movb dl, 5[edi]
subb cl, ch
subl edx, ebx
andl ecx, 0xff
xorb dl, dh
addl eax, ecx
subb dl, dh
andl edx, 0xff
movb bl, 6[esi]
addl eax, edx
movb cl, 6[edi]
subl ecx, ebx
movb bl, 7[esi]
xorb cl, ch
movb dl, 7[edi]
subb cl, ch
subl edx, ebx
andl ecx, 0xff
xorb dl, dh
addl eax, ecx
subb dl, dh
andl edx, 0xff
movb bl, 8[esi]
addl eax, edx
movb cl, 8[edi]
subl ecx, ebx
movb bl, 9[esi]
xorb cl, ch
movb dl, 9[edi]
subb cl, ch
subl edx, ebx
andl ecx, 0xff
xorb dl, dh
addl eax, ecx
subb dl, dh
andl edx, 0xff
movb bl, 10[esi]
addl eax, edx
movb cl, 10[edi]
subl ecx, ebx
movb bl, 11[esi]
xorb cl, ch
movb dl, 11[edi]
subb cl, ch
subl edx, ebx
andl ecx, 0xff
xorb dl, dh
addl eax, ecx
subb dl, dh
andl edx, 0xff
movb bl, 12[esi]
addl eax, edx
movb cl, 12[edi]
subl ecx, ebx
movb bl, 13[esi]
xorb cl, ch
movb dl, 13[edi]
subb cl, ch
subl edx, ebx
andl ecx, 0xff
xorb dl, dh
addl eax, ecx
subb dl, dh
andl edx, 0xff
movb bl, 14[esi]
addl eax, edx
movb cl, 14[edi]
subl ecx, ebx
movb bl, 15[esi]
xorb cl, ch
movb dl, 15[edi]
subb cl, ch
subl edx, ebx
andl ecx, 0xff
xorb dl, dh
addl eax, ecx
subb dl, dh
andl edx, 0xff
popl ebx
addl eax, edx
popl edi
popl esi
ret
estim EndP
We compared the MMX technology technique to the C Language implementation as well as to the optimized scalar version. On a Pentium® processor implementation, the performance improvements are as follows:
.586
include MMX .inc
ASSUME ds:FLAT, cs:FLAT, ss:FLAT
_TEXT SEGMENT DWORD PUBLIC USE32 'CODE'
_TEXT ENDS
_TEXT SEGMENT DWORD PUBLIC USE32 'CODE'
; basic flow outlined in appnote.
; performing saturated subtractions and merging results of differences (por).
; unpacking is then performed to 16-bit precision to accumulate differences without
; overflow. Accumulation into mm7 is then done and returned in mm7.
estim Proc Near C Public m1:PTR DWORD, m2:PTR DWORD
movq mm0, DWORD PTR [m1] ; grab 1st half of row from bptr
movq mm1, DWORD PTR [m2] ; grab 1st half of row from cptr
movq mm2, mm0 ; make a copy for abs diff operation
movq mm3, DWORD PTR 8[m1] ; grab 2nd half of row from bptr
psubusb mm0, mm1 ; do subtraction one way (w/saturation)
psubusb mm1, mm2 ; do subtraction the other way (w/saturation)
movq mm4, DWORD PTR 8[m2] ; grab 2nd half of row from cptr
por mm0, mm1 ; merge results of 1st half
movq mm5, mm3 ; make a copy for abs diff operation
movq mm1, mm0 ; keep a copy
psubusb mm3, mm4 ; do subtraction one way (w/saturation)
punpcklbw mm0, mm6 ; unpack to higher precision for accumulation
psubusb mm4, mm5 ; do subtraction the other way (w/saturation)
psrlq mm1, 32 ; shift results for accumulation
por mm3, mm4 ; merge results of 2nd half
punpcklbw mm1, mm6 ; unpack to higher precision for accumulation
movq mm4, mm3 ; keep a copy
punpcklbw mm3, mm6 ; unpack to higher precision for accumulation
paddw mm7, mm0 ; accumulate difference...
psrlq mm4, 32 ; shift results for accumulation
paddw mm7, mm1 ; accumulate difference...
punpcklbw mm4, mm6 ; unpack to higher precision for accumulation
paddw mm7, mm3 ; accumulate difference...
paddw mm7, mm4 ; accumulate difference...
ret
estim EndP
_TEXT ENDS
END